I have a question regarding the
# spicedb
t
I have a question regarding the performance of `CheckPermission`: I'm testing this schema with some Python scripts (SpiceDB + Postgres running on Docker): https://play.authzed.com/s/NVTEhOOCjAPS/schema Test data is like in the playground, but I also added 1,000 "knowledgebases" to object "org1" with relation "owner_org" and 999,000 "knowledgebases" to object "org2" with relation "owner_org". (This was done for testing the performance of
LookupResources
.) I'm doing those 12 permission checks as defined in the playground and I'm surprised about the deviations in the execution time.
Copy code
start_time_ms = time.time() * 1000
resp = client.CheckPermission(
    CheckPermissionRequest(
        #consistency=consistency,
        resource=kb1,
        permission="view",
        subject=user1,
    )
)
end_time_ms = time.time() * 1000
print(f"Duration {end_time_ms - start_time_ms} ms")
...
The lowest execution time I've seen was < 1 ms, the highest was > 100 ms. These are my last 2 runs (same order of checks):
Copy code
Duration 7.9990234375 ms
Duration 1.00146484375 ms
Duration 47.999267578125 ms
Duration 98.03125 ms
Duration 50.020263671875 ms
Duration 54.95703125 ms
Duration 44.98779296875 ms
Duration 50.00244140625 ms
Duration 49.06201171875 ms
Duration 51.076171875 ms
Duration 49.76171875 ms
Duration 49.468505859375 ms
Copy code
Duration 10.9677734375 ms
Duration 2.00048828125 ms
Duration 1.001220703125 ms
Duration 3.025634765625 ms
Duration 0.965576171875 ms
Duration 0.977294921875 ms
Duration 46.08935546875 ms
Duration 0.98779296875 ms
Duration 1.041259765625 ms
Duration 47.974365234375 ms
Duration 52.947265625 ms
Duration 3.998779296875 ms
Are those deviations expected? Or has that something to do with my setup? I thought the same check would take about the same duration each time if nothing changed. I also thought using
consistency=Consistency(fully_consistent=True)
would change those durations to ~50 ms or more each time, but it did not notice any differences.
v
this is expected. Data is cached in memory for a given amount of time, defined by the so called quantization window. if the subproblems needed are not part of the current quantization window, the server will have to go and recompute them. Same for fully-consistent: it also uses caching if head revision is in memory. The only difference is how the revision is computed. If no writes have taken place, then no new revision.
are those checks supported going through the path of 990K elements?
or you are testing a different part of the graph?
t
> Data is cached in memory for a given amount of time, defined by the so called quantization window. That's 5 seconds by default, right? When I do the same checks 3 times performance does not seem to improve, though.
Copy code
Started at 2024-02-21 14:44:13.707322
Duration 9.001953125 ms
Duration 1.998779296875 ms
Duration 1.00048828125 ms
Duration 3.001220703125 ms
Duration 0.998046875 ms
Duration 48.5283203125 ms
Duration 50.00048828125 ms
Duration 51.132080078125 ms
Duration 49.356201171875 ms
Duration 50.0 ms
Duration 51.000244140625 ms
Duration 49.0947265625 ms

Started at 2024-02-21 14:44:14.071434
Duration 1.958984375 ms
Duration 47.00390625 ms
Duration 50.943603515625 ms
Duration 100.94140625 ms
Duration 49.697509765625 ms
Duration 50.099853515625 ms
Duration 51.009033203125 ms
Duration 48.994140625 ms
Duration 50.000732421875 ms
Duration 51.232666015625 ms
Duration 48.90966796875 ms
Duration 49.99853515625 ms

Started at 2024-02-21 14:44:14.621280
Duration 49.306884765625 ms
Duration 51.226806640625 ms
Duration 49.12548828125 ms
Duration 100.440185546875 ms
Duration 48.993896484375 ms
Duration 50.0966796875 ms
Duration 50.278076171875 ms
Duration 48.32568359375 ms
Duration 52.20556640625 ms
Duration 48.078125 ms
Duration 51.133056640625 ms
Duration 48.33154296875 ms
The checks are in this order:
Copy code
1. resource=kb1, permission="view", subject=user1
2. resource=kb1, permission="view", subject=user2
3. resource=kb1, permission="view", subject=user3
4. resource=kb1, permission="view", subject=admin1
5. resource=kb1, permission="edit", subject=user1
6. resource=kb1, permission="edit", subject=user2
7. resource=kb1, permission="edit", subject=user3
8. resource=kb1, permission="edit", subject=admin1
9. resource=org1, permission="membership", subject=user1
10. resource=org1, permission="membership", subject=user2
11. resource=org1, permission="membership", subject=user3
12. resource=org1, permission="membership", subject=admin1
The last 4 checks should have nothing to do with the 1.000.000 knowledgebases assigned to org1/org2.
Copy code
definition organization {
    relation admin: user
    relation member: user
    permission membership = admin + member
}
But as you can see they still take 50+ ms a lot of times.
v
there may be something off happening for wide relations. I'd suggest visualizing the open-telemetry traces to see what's going on. You can also do
zed backup create
of your instance and send it our way
t
I'm trying to set up tracing. Just want to mention the links to "global flags" here are broken: https://authzed.com/docs/spicedb/ops/observability
I think I have to set --otel-endpoint="", --otel-insecure and --otel-provider="otlpgrpc"?
v
correct
t
Got it to work like this:
--otel-endpoint="jaeger:4317" --otel-provider="otlpgrpc" --otel-sample-ratio 1 --otel-insecure
Copy code
Started at 2024-02-22 11:09:02.202036
Duration 51.400146484375 ms
Duration 49.38134765625 ms
Duration 49.435546875 ms
Duration 98.85498046875 ms
Duration 50.096923828125 ms
Duration 49.4140625 ms
Duration 50.287353515625 ms
Duration 49.98583984375 ms
Duration 50.41748046875 ms
Duration 50.361328125 ms
Duration 48.8427734375 ms
Duration 51.60595703125 ms
Those durations don't really match what's shown in the traces (everything at 0.x ms). Do you know where that additional delay could be coming from? Is it just network? But everything is running locally on the same machine in Docker containers... https://cdn.discordapp.com/attachments/1209844584520097812/1210178733160534107/grafik.png?ex=65e99dc6&is=65d728c6&hm=04566c7170e827f7ff00c1ba98a997e2c5333d78c533cf6b2392584e1a08d0e6&
v
are you observing a trace of
CheckPermission
? or is that perhaps something else?
I noticed the HTTP API is a lot faster than gRPC in Python. Python code:
Copy code
for x in range(10):
    #print(f"Started at {datetime.now()}")

    start_time_ms = time.time() * 1000
    resp = client.CheckPermission(
        CheckPermissionRequest(
            consistency=Consistency(minimize_latency=True),
            resource=ObjectReference(object_type="knowledgebase", object_id="kb1"),
            permission="view",
            subject=SubjectReference(object=ObjectReference(object_type="user", object_id="user1"))
        )
    )
    end_time_ms = time.time() * 1000
    print(f"Duration {end_time_ms - start_time_ms} ms")
Python logs:
Copy code
Duration 9.5419921875 ms
Duration 4.2685546875 ms
Duration 49.626708984375 ms
Duration 51.244140625 ms
Duration 49.075927734375 ms
Duration 50.07470703125 ms
Duration 49.827880859375 ms
Duration 49.335205078125 ms
Duration 49.7177734375 ms
Duration 49.9755859375 ms
gRPC in Python takes about 50 ms most of the time. HTTP API request in Postman:
http://localhost:8443/v1/permissions/check
Copy code
{
  "consistency": {
    "minimizeLatency": true
  },
  "resource": {
    "objectType": "knowledgebase",
    "objectId": "kb1"
  },
  "permission": "view",
  "subject": {
    "object": {
      "objectType": "user",
      "objectId": "user1"
    }
  }
}
HTTP API requests take between 4-7 ms most of the time.
d
We're also using python + gRPC against a selfhosted SpiceDB custer in k8s, and were having a frustrating time. Performance for us is mostly very fast, but we have been having gRPC issues against the selfhosted cluster. Interestingly, not against the serverless: https://discord.com/channels/844600078504951838/844600078948630559/1209441355655544902 Interesting to see that the httpAPI is more efficient for you. Performance is acceptable for us as is, but we might take a look at switching. The spiceDB logs indicate ~2-10ms processing times for most of our queries. Roundtrip time from Python is definitely much larger than that, but still a small fraction of the time our python API takes to gather metadata and respond.
t
I've written a small Python script comparing the performance of gRPC and HTTP. HTTP is a lot faster. Maybe something is wrong with the gRPC client? Output:
Copy code
--- START GRPC ---
Duration 12.93017578125 ms
Duration 1.581787109375 ms
Duration 41.468017578125 ms
Duration 49.78662109375 ms
Duration 50.671142578125 ms
Duration 50.17919921875 ms
Duration 50.543212890625 ms
Duration 50.43701171875 ms
Duration 49.665771484375 ms
Duration 51.866455078125 ms
--- END GRPC ---
--- START HTTP ---
Duration 11.78662109375 ms
Duration 4.828125 ms
Duration 6.3837890625 ms
Duration 4.811279296875 ms
Duration 3.78759765625 ms
Duration 4.275146484375 ms
Duration 5.304443359375 ms
Duration 4.394775390625 ms
Duration 6.509521484375 ms
Duration 3.7783203125 ms
--- END HTTP ---
https://cdn.discordapp.com/attachments/1209844584520097812/1210202026361888768/grpc_vs_http.py?ex=65e9b377&is=65d73e77&hm=ec5f28997eabfe6690a78d9afd03b592a2bde90b2b367c6196a76e2f5af02179&
v
that does not make sense. The HTTP API is literally hitting the gRPC internally
maybe something is broken about the python gRPC client?
can you try with something like grpcurl?
which version of the authzed py client are you using?
t
Copy code
pip show authzed
Name: authzed
Version: 0.13.0
Summary: Client library for SpiceDB.
Home-page:
Author: Authzed
Author-email: support@authzed.com
License:
Location: c:\users\...\appdata\local\programs\python\python310\lib\site-packages
Requires: google-api-core, google_api, grpcio, protobuf, typing-extensions
Required-by:
Seems to be the newest version: https://pypi.org/project/authzed/
v
@torben_26072 would you mind opening an issue in authzed-py with your findings? @Duncan feel free to add there too any information you may have. @torben_26072 if you can export with
zed backup create
and upload it somewhere and reference it in the issue that'd be super usefl
t
Will do that.
v
if you configured your client with open telemetry and propagated the changes, and captured traces from it to, we would be able to see where the time is being spent. Did you check the gRPC latencies reported by SpiceDB logs? do they show the same gRPC latencies as the client?
t
Windows 10 22H2, Intel Core i7 13700HX
grpc.time_ms is 0.
Copy code
2024-02-22 14:17:33 {"level":"info","protocol":"grpc","grpc.component":"server","grpc.service":"authzed.api.v1.PermissionsService","grpc.method":"CheckPermission","grpc.method_type":"unary","requestID":"b9b9c315e73f05c0357710a083e290ad","peer.address":"172.19.0.1:35872","grpc.start_time":"2024-02-22T13:17:33Z","grpc.code":"OK","grpc.time_ms":0,"time":"2024-02-22T13:17:33Z","message":"finished call"}
2024-02-22 14:17:33 {"level":"info","protocol":"grpc","grpc.component":"server","grpc.service":"authzed.api.v1.PermissionsService","grpc.method":"CheckPermission","grpc.method_type":"unary","requestID":"7c1d3564adcdf1524b8f868699e0239a","peer.address":"172.19.0.1:35872","grpc.start_time":"2024-02-22T13:17:33Z","grpc.code":"OK","grpc.time_ms":0,"time":"2024-02-22T13:17:33Z","message":"finished call"}
2024-02-22 14:17:33 {"level":"info","protocol":"grpc","grpc.component":"server","grpc.service":"authzed.api.v1.PermissionsService","grpc.method":"CheckPermission","grpc.method_type":"unary","requestID":"c4da1ecf85a49e90167785fa4febad5e","peer.address":"172.19.0.1:35872","grpc.start_time":"2024-02-22T13:17:33Z","grpc.code":"OK","grpc.time_ms":0,"time":"2024-02-22T13:17:33Z","message":"finished call"}
Those requests measured ~50 ms in Python.
v
yikes
are you referencing
localhost
when initializing the client or
127.0.0.1
?
just an additional data point, the gRPC benchmarking dashboard shows how python implementation is always the slowest. QPS throuhgput is terrible https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5180705743044608
d
Sorry to confuse your thread, Torben. We also see lots of
grpc_time_ms:0
requests taking much longer by the time python gets its answer, but we do have a trip over the internet and through k8s via an nginx load balancer. Our overall results are not THAT dissimilar however: https://cdn.discordapp.com/attachments/1209844584520097812/1210219492316020787/image.png?ex=65e9c3bb&is=65d74ebb&hm=eb36d1e730ea33ff623c5fd6371dc9894c9f49e09d08a6f08ae314eda1332d24& https://cdn.discordapp.com/attachments/1209844584520097812/1210219492676993064/image.png?ex=65e9c3bb&is=65d74ebb&hm=d5288f40d554a4411a4ce7f5569753218ff54e1c8d5ff45fc662c6e0df127264&
t
Oh wow, you found the problem. I changed localhost to 127.0.0.1 and now it looks like this:
Copy code
CHECK_PERMISSION:
--- START GRPC ---
Duration 7.995361328125 ms
Duration 1.653076171875 ms
Duration 2.1259765625 ms
Duration 1.63671875 ms
Duration 1.623291015625 ms
Duration 1.56005859375 ms
Duration 1.61279296875 ms
Duration 1.6015625 ms
Duration 1.08837890625 ms
Duration 1.05615234375 ms
--- END GRPC ---
--- START HTTP ---
Duration 8.16552734375 ms
Duration 4.935546875 ms
Duration 3.897705078125 ms
Duration 4.309814453125 ms
Duration 4.350341796875 ms
Duration 4.359130859375 ms
Duration 3.88623046875 ms
Duration 3.833740234375 ms
Duration 7.128662109375 ms
Duration 3.900634765625 ms
--- END HTTP ---
That's for
CheckPermission
. For
LookupResources
the HTTP API still is a lot faster, probably because the result (1,000 resources) is not streamed.
Copy code
LOOKUP_RESOURCES:
--- START GRPC ---
Duration 55.7783203125 ms
Duration 37.469482421875 ms
Duration 40.148193359375 ms
Duration 41.048583984375 ms
Duration 44.19091796875 ms
Duration 44.93505859375 ms
Duration 40.215576171875 ms
Duration 41.070556640625 ms
Duration 49.726806640625 ms
Duration 39.05126953125 ms
--- END GRPC ---
--- START HTTP ---
Duration 15.42919921875 ms
Duration 19.097412109375 ms
Duration 14.84326171875 ms
Duration 19.34765625 ms
Duration 17.08740234375 ms
Duration 15.32421875 ms
Duration 20.00341796875 ms
Duration 17.95166015625 ms
Duration 16.96728515625 ms
Duration 17.973876953125 ms
--- END HTTP ---
d
oh, nice!
t
For me the question is now: Is there a general performance problem with Python gRPC when using a domain instead of an IP address? It's not always possible to use a static IP address.
@vroldanbet Should I still create an issue for the gRPC performance problems with Python?
v
From a cursory look online, everything points at python's implementation of gRPC being bad. So I'm not surprised about the Lookup result since you read about how proto serialization/serialization is slow in python
Maybe the recommendation moving forward is to tell people not to do gRPC client with python
And instead use the HTTP Gateway
t
But that doesn't explain the difference between gRPC performance for
CheckPermission
with localhost and 127.0.0.1. Maybe something can be changed there?
That depends on the size of the returned resources. For 1,000 returned resources HTTP is a lot faster. For 100 returned resources there's not really any difference. For 10 resources gRPC is faster than HTTP.
v
Yeah, that I don't know, I just found a GitHub issue online that said that said the networking between host and docker subsystem in Windows was slow for whatever reason
Yeah, as mentioned, proto in python being kinda slow
t
It's possible it has something to do with Docker, but localhost:50051 (gRPC) being slow and localhost:8443 (HTTP) being fast for
CheckPermission
is still strange.
v
Can you try running the python script inside of the container?
To skip the networking stack
t
Could be something like this: https://github.com/pytorch/serve/issues/1375
Good idea.
I don't really know how to get Python running in the SpiceDB container. There's no package manager or gcc.
I'm now using the Windows version of SpiceDB and the performance of localhost and 127.0.0.1 are similar. So it's most likely a Docker or WSL2 problem:
Copy code
CHECK_PERMISSION:
--- START LOCALHOST ---
Duration 7.581298828125 ms
Duration 1.0810546875 ms
Duration 1.111328125 ms
Duration 1.0556640625 ms
Duration 0.536865234375 ms
Duration 1.054931640625 ms
Duration 1.059326171875 ms
Duration 0.53173828125 ms
Duration 1.0576171875 ms
Duration 1.080078125 ms
--- END LOCALHOST ---
--- START 127.0.0.1 ---
Duration 1.719970703125 ms
Duration 1.083984375 ms
Duration 0.52734375 ms
Duration 1.134033203125 ms
Duration 0.904296875 ms
Duration 0.85009765625 ms
Duration 0.296630859375 ms
Duration 1.37548828125 ms
Duration 0.818115234375 ms
Duration 2.551513671875 ms
--- END 127.0.0.1 ---
Copy code
LOOKUP_RESOURCES:
--- START LOCALHOST ---
Duration 43.968994140625 ms
Duration 34.41650390625 ms
Duration 33.071044921875 ms
Duration 36.738525390625 ms
Duration 37.85009765625 ms
Duration 44.834716796875 ms
Duration 39.364990234375 ms
Duration 39.24169921875 ms
Duration 40.6171875 ms
Duration 35.39599609375 ms
--- END LOCALHOST ---
--- START 127.0.0.1 ---
Duration 37.03466796875 ms
Duration 37.453125 ms
Duration 38.0908203125 ms
Duration 33.736083984375 ms
Duration 31.812255859375 ms
Duration 33.2841796875 ms
Duration 34.0 ms
Duration 30.41259765625 ms
Duration 35.941650390625 ms
Duration 35.09912109375 ms
--- END 127.0.0.1 ---
v
ok, so it's docker stuff on windows. Good to know
that's for working through this 🙇‍♂️
useful information to have
5 Views